# T-725 Natural Language Processing: Lab 1
In these labs, we will be using the [Python 3](https://www.python.org/) programming language and the [Natural Language Toolkit (NLTK)](https://www.nltk.org/). We will also be using Google Colab, a free service hosted by Google, which gives us access to a Linux machine that comes pre-installed with Python 3 and the NLTK.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Using Google Colab
Google Colab allows users to work with "notebooks", which consists of text cells and code cells. Text cells can be edited by double clicking them. A code cell can be executed by selecting it and pressing `Ctrl + Enter`. Code is shared between cells, meaning that you can, for example, create a variable in one cell and use it in another cell later on.

To begin with, do the following:
* Select `"File" > "Save a copy in Drive"` to create a local copy of this notebook that you can edit.
* Select `"Runtime" > "Run all"` to run all of the code cells in the notebook.

## Resources
* [The Python Standard Library](https://docs.python.org/3/library/index.html) - an overview of the built-in libraries in Python with a lot of examples.
* [The Python Tutorial](https://docs.python.org/3/tutorial/index.html) - an official tutorial that gives a brief overview of the language.
* [Automate the Boring Stuff with Python](https://automatetheboringstuff.com/) - a free book that offers a good introduction to the Python programming language to beginners.
* [Natural Language Processing with Python](http://www.nltk.org/book/) - a free companion book to the NLTK toolkit.

## Setting Python and the NLTK up on your own machine
* [Python 3](https://realpython.com/installing-python/) - installation instructions
* [NLTK](https://www.nltk.org/install.html) - installation instructions

## String methods in Python
There are many ways to manipulate strings in Python. A full list of methods for the String class may be found in the [library reference](https://docs.python.org/3/library/stdtypes.html#string-methods).

In [None]:
a_string = "It was the best of times, it was the worst of times"

print("Lowercase:", a_string.lower())
print("'times' count:", a_string.count('times'))
print("First occurence of 'best':", a_string.find('best'))

## Lists, sets and built-in functions
Lists and sets are two kinds of collections that can be used in Python.
* A **list** is an *ordered* sequence of elements. Lists are enclosed with square brackets, and the elements are separated with a comma (e.g., `a_list = ["This", "is", "a", "list"]`).
* A **set** is a collection of *unordered* and *unique* elements (meaning that it contains no duplicates). Sets are enclosed by curly braces, and the elements are separated by commas (e.g., `a_set = {"This", "is", "a", "set"}`).

In [None]:
# Variables can be converted to lists and sets with the list() and set() functions
char_list = list(a_string)
char_set = set(a_string)

# You can also split strings on certain characters to create a list of strings
words = a_string.split()  # Splits on whitespaces by default
print("Split string:", words)

# The len() and max() built-in methods
print("Unique characters:", char_set)
print("No. of unique characters:", len(char_set))
print("Longest word:", max(words, key=len))

You can use the built-in function `help()` to quickly access documentation for a given object.

In [None]:
help(max)

## Indices and slicing
You can get characters and substrings at specific indexes:

In [None]:
print("String:", a_string)
print("First character:", a_string[0])  # Indices in Python start at 0
print("Last character:", a_string[-1])  # A negative index starts counting from the end

# We can get ranges of elements by slicing a list
print("Characters 11 to 14:", a_string[11:15])
print("First 6 characters:", a_string[:6])
print("Last 5 characters:", a_string[-5:])

## The Natural Language Toolkit (NLTK)
The NLTK book, [Natural Language Processing with Python](http://www.nltk.org/book/), is an introduction to natural language processing in Python, using the NLTK library. [Chapter 1](http://www.nltk.org/book/ch01.html) is relevant to this lab. The NLTK comes with a lot of data, such as corpora and trained models. We can download this data with the `nltk.download()` function.

In [None]:
import nltk
from nltk.corpus import gutenberg

# Download the 'gutenberg' corpus, which is a collection of books in the public domain
nltk.download('gutenberg')

# Get a plain-text version of Moby Dick (contained in a single string)
moby_raw = gutenberg.raw('melville-moby_dick.txt')

# Get all the tokens in Moby Dick (as a list of strings)
moby_tokens = gutenberg.words('melville-moby_dick.txt')

# Print the first 10 tokens of Moby Dick
print("First 10 tokens:", moby_tokens[:10])

# Print the first 250 characters of Moby Dick
print("\nFirst 250 characters:\n", moby_raw[:250])

NLTK includes a Text class for analyzing the contents of texts. Let's print a concordance for the word *Iceland*:

In [None]:
moby_text = nltk.Text(moby_tokens)
moby_text.concordance('Iceland')

NTLK offers several ways to segment text into sentences and tokenize it. Let's see how `nltk.sent_tokenize()` and `nltk.word_tokenize()` work:

In [None]:
# Download the Punkt tokenizer model
nltk.download('punkt')

moby_sentences = nltk.sent_tokenize(moby_raw)  # Split raw text into sentences
tokens = nltk.word_tokenize(moby_sentences[3])  # Split a string into tokens

print("First 5 sentences:")
for sentence in moby_sentences[:5]:
    print(">>>", sentence)

print(f"\nTotal number of sentences: {len(moby_sentences):,}")

print("\nTokens:", tokens)

## Regular expressions
The Python standard library includes the `re` module for handling regular expressions ([reference](https://docs.python.org/3/library/re.html)). In Python, regular expression patterns should be created using *raw* strings, which are prefixed with an `r`:

In [None]:
import re
re.findall(r'\b\S{9,}est\b', moby_raw)

You can capture text that matches a specific part of a pattern in a group by enclosing it within parentheses. When making substitutions with `re.sub()`, you can refer to these groups using `\number` (e.g., `\1` and `\2`), where the number refers to their position in the pattern:

In [None]:
another_string = "The grapes of wrath"
re.sub(r'(\S+) of (\S+)', r'\2 of \1', another_string)

NLTK offers a simple way of searching for sequences of tokens using regular expressions, where tokens can be enclosed in angle brackets:

In [None]:
# First we must create a Text object from a list of tokens
moby_text = nltk.Text(moby_tokens)

# Let's search for sequences of four tokens which all begin with the letter S
moby_text.findall(r'<[Ss].*>{4,}')

You can use groups to target specific tokens.

In [None]:
moby_text.findall(r'<[Aa]n?>(<.+>)<ship>')

# Assignment
Answer the following questions and hand in your solution in Canvas before midnight tonight, September 1st. Make a copy of this notebook (`"File" > "Save a copy in Drive"`) and enter your solutions in the cells below each question. Remember to save your file before uploading it.

### Question 1
Get the raw text of `carroll-alice.txt` (Alice in Wonderland) from the Gutenberg corpus in NLTK and tokenize it using `nltk.word_tokenize()`.

1. How many tokens does it contain in total?
2. How many unique tokens does it contain?

In [None]:
# answer: carrol-alice.txt contains 33494 tokens
import nltk
nltk.download('gutenberg')

tokens = nltk.tokenize.word_tokenize(gutenberg.raw('carroll-alice.txt'))
print("Total number of tokens:", len(tokens))

### Question 2
Use `nltk.FreqDist()` to create a frequency distribution of all the tokens in Alice in Wonderland. What are the 20 most frequently occurring tokens?

In [None]:
# Your solution here
import nltk
nltk.FreqDist(tokens).most_common(20)

### Question 3
Use `nltk.sent_tokenize()` to segment Alice in Wonderland into sentences, then find the longest sentence in the book.

In [None]:
# Your solution here
sentences = nltk.sent_tokenize(gutenberg.raw('carroll-alice.txt'))
longest = max(sentences, key=len)
print("Longest sentence:", longest)
print("\nLength:", len(longest))

### Question 4
Use a regular expression to find all tokens in Alice in Wonderland that contain an *x* and end with *ed*.

In [None]:
# Your solution here
import re

tokens_2 = re.findall(r'\w+x\w*ed\b', gutenberg.raw('carroll-alice.txt'))

### Question 5
Use `re.sub()` to "dehyphenate" the following string:

>It is a capital mistake to theo-  
>rize before one has data. Insen-  
>sibly one begins to twist facts  
>to suit theories, instead of the-  
>ories to suit facts.

You will need to use groups to recombine the words. The resulting string should look like this:

>It is a capital mistake to  
>theorize before one has data.  
>Insensibly one begins to twist facts  
>to suit theories, instead of  
>theories to suit facts.

Remember that a "newline" character is represented by `\n` in strings and regular expression patterns.

In [None]:
# Your solution here
hyphenated = """
It is a capital mistake to theo-
rize before one has data. Insen-
sibly one begins to twist facts
to suit theories, instead of the-
ories to suit facts.
"""

dehyphenated = re.sub(r'(\w+)-\n(\w+)', r'\n\1\2', hyphenated)
print(dehyphenated)

## UNIX Text Processing Tools
Solve the following questions using UNIX tools for text processing. Use the resources in the "Text Processing Tools" module on Canvas for this part of the lab.

Download the 'lab1.txt' on the Testfiles page in that module.

To run UNIX commands in CoLab, start your line with an !

Upload the necessary files into CoLab by clicking the folder icon in the left side navigation bar and then the icon that has a document with an up-arrow on it.

###Question 6

Use egrep to match all lines that make reference to a decade.

In [None]:
# Your solution here

!egrep [1-9]{{3}}0's\b' lab1.txt

###Question 7

Use sed to replace all occurrences of years or decades with the string "CLASSIFIED". Note, special symbols like + and ? must be escaped using the \ character.

In [None]:
# Your solution here

!sed -E 's/[0-9]{{4}}s?/CLASSIFIED/g' lab1.txt


###Question 8

Translate all uppercase letters to lowercase using `tr '[:upper:]' '[:lower:]' < inputfile`

Then use `sed` to replace all non-alphanumeric characters with a newline, and remove all empty lines. Store the output in a file called `tokens.txt`.

In [None]:
# Your solution here

!tr '[:upper:]' '[:lower:]' < lab1.txt | sed -E 's/[^a-z0-9]/\n/g' | sed '/^\s*$/d' > tokens.txt

### Question 9

Create a unigram model of `tokens.txt` using UNIX commands.

In [None]:
# Your solution here

!cat tokens.txt | sort | uniq -c | sort -nr > unigram.txt