# Combined Labs 6 & 7, Python for Text, Fall 2020
## Lab 6 topics: Structuring programs, error handling (sections 1-2)
## Lab 7 topics: Preprocessing with spaCy (sections 3-5)

---
To be completed and submitted by end of the day on Monday, October 19. 

The questions you need to answer are marked with **QUESTION**. For each one, there's a space (under **ANSWER**) for you to add your answer, which might be text, might be code, or might be a mix of the two. 



### Have questions?

1. Use the Canvas discussion boards.

## Reminder

You will need to submit your own notebook (or rather, a link to your own notebook). There are (at least) two ways to do this:

1. Make a copy of this notebook, rename it, and add new code cells when you want to write your own code.
1. Create a new Python 3 notebook using the `File` menu and type in all cells yourself. 

Either way, your file should be named using this format: `LastName_Py4Text_Lab67.ipynb`

#Overview: Working with scholarly/scientific data

---

{this is a long lab - don't wait to get started!}

This week's lab covers several important topics:

1. Working with scholarly/scientific data,
2. Some methods for structuring and testing the Python code you write, and 
3. Preprocessing of textual data.

The coding and debugging strategies you'll learn will be useful whenever you are programming. Preprocessing is an essential part of working with textual data. In essence, preprocessing helps us deal with some of the messiness that is part of text (and language). This week you'll learn about preprocessing with the example of scholarly data, but the preprocessing steps you'll learn about are *not* specific to scholarly data - they apply for just about any kind of written text.


Things you should be able to do at the end of this lab:

* Understand what a **scientific abstract** is
* Get familiar with one small corpus of scientific abstracts
* Write **pseudocode**
* Structure programs by chaining functions together
* Use print statements for **debugging**
* Use **`assert`** for debugging
* Use **`try/except`** for error handling
* Define a **class** and create **objects** from that class
* Understand the role and function of these **preprocessing** steps for textual data: **tokenization**, **sentence splitting**, **lemmatization**, and **part-of-speech (POS) tagging**
* Use **spaCy** to perform preprocessing 



# 1. Scientific abstracts as textual data

In the previous labs, we worked with news data and then with the texts of novels (or other fictional works). This week, we turn to scholarly data, and specifically to research articles - scientific publications. Research articles are one of the main sources of new knowledge in science. This is where researchers present their experimental work and their new results, across many different disciplines. Research papers are also extremely important in the humanities and social sciences. Such papers can be published in journals, proceedings of conferences, scholarly archives, and even on people's websites. 

Though the structure of the typical scientific publication changes somewhat from one academic discipline to another, nearly all disciplines include the **abstract** as part of that structure. An abstract is essentially a short summary of the article. In a very small amount of text - usually one or two paragraphs - an abstract:

* explains why the research is important,
* mentions the main methods used in the work, and
* outlines the most important conclusions and results.

Some people say that the abstract should convince a potential reader that it's worthwhile to read the rest of the paper. So, abstracts can give us a pretty good idea what a paper is about. In many fields, the number of articles published every year is *way* more than one individual can read and keep up with, and abstracts can be very helpful in deciding which papers to read, and then which to read first. We can also imagine using automatic methods for scanning abstracts in order to stay on top of the flood of scientific literature in a field.

**READING:** To learn more about some of the potential uses of abstracts, read the first two and a half pages (through the end of section 3) of this paper: http://clg.wlv.ac.uk/papers/orasan-01b.pdf

Title: Patterns in Scientific Abstracts <br>
Author: Constantin Orason, University of Wolverhampton <br>
In: *Proceedings of the 2001 Corpus Linguistics Conference* <br>

The rest of the paper provides a nice example of a corpus analysis, as Orason describes the corpus he has collected and analyzed in great detail. If you think you might like to do a **corpus analysis** for your final project, this paper offers an excellent example.

### **QUESTION:**

> How many abstracts are included in the corpus Orason analyzes? Which field has the most abstracts? Which has the fewest? 

### **ANSWER:**

> 917 total abstracts with the field of Artificial Intelligence having the most files (512). Both Anthropology and Linguistics have the fewest amount with 50 files each. 

-----




## Datasets

In addition to being sources of new knowledge, academic articles are themselves widely used as the subject of other scientific studies, with the result that many different corpora of academic articles have been assembled.

This webpage gives a nice overview of different corpora and their properties (including their availability): https://www.clarin.eu/resource-families/corpora-academic-texts

For this week's lab, we'll be using mini-corpora that I have created by taking subsets of two existing corpora. One is a corpus of abstracts, and the other is a corpus of full-length scientific articles. In both cases, the texts have been transformed into plain text format, making it much easier to read into a computer program - imagine trying to deal with a bunch of pdfs in Python!

Our first mini-corpus is a subset of the **SciDTB** corpus. This is a collection of abstracts which have been annotated with discourse structure and further analyzed for the function of each sentence. We're not using the annotated version, though, just the plain text. If you're interesting in reading more about the corpus and how it has been used in research, see this paper (*not* a required reading): https://www.aclweb.org/anthology/P18-2071/

The second mini-corpus is a subset of the **ACL-ARC** corpus. This is a collection of computational linguistics papers. I've randomly selected 100 papers from the original corpus, which consists of more than 10,000 papers. We're using a version in which the original pdfs have been converted to plain text, and figures and tables have either been removed or transformed into a machine-friendly format. In addition, all special characters have been removed, so that we only have ASCII characters (we'll talk about encoding near the end of the semester - if you're interested in learning more in the meantime, here's Wikipedia on ASCII: https://en.wikipedia.org/wiki/ASCII). Here's the paper describing the corpus in more detail (optional reading): http://www.lrec-conf.org/proceedings/lrec2008/pdf/445_paper.pdf

Take a look at the webpage for the ACL-ARC corpus, where you can find a description of the different ways the data has been cleaned up: https://web.eecs.umich.edu/~lahiri/acl_arc.html

### **QUESTION:**

> Describe two kinds of data cleaning done by the corpus creators.

### **ANSWER**

> The corpus creators have removed headers from papers (this does not include the abstracts) as well as the footers (which does include the references).

-----




## Accessing the data

For this week's lab, I have created two mini-corpora for you to explore. Both are subsets of larger corpora (described in Section 1), and in both cases I have created a compressed zip file containing the texts: 

* the `sciDTB` mini-corpus contains 153 plain-text abstracts from research papers in computational linguistics
* the `arcSmall` mini-corpus contains 100 plain-text versions of full-length research papers in computational linguistics

You can fetch these using the following commands - after running this code block, the `.zip` files should show up in your list of Colab files (click the folder icon to the left).

In [None]:
### download sciDTB zip file
!wget https://github.com/alexispalmer/py4text/raw/main/data/SciDTB-mini.zip

### download arcSmall zip file
!wget https://github.com/alexispalmer/py4text/raw/main/data/aclArc-mini.zip

Once you have downloaded the zip files, you need to unzip them to have access to the files. 

SIDE NOTE: You'll notice that both of the lines below, and also the `wget` commands we used to download the zip files, start with `!`. We can use this to run commands that we would normally run on the command line (specifically, in a `bash` shell).

In [None]:
!unzip SciDTB-mini.zip
!unzip aclArc-mini.zip

Now refresh your folders listing (click on the folder with the round arrow, above the list of file and folder names), and you can see two new folders, one for each mini-corpus. Click on one of these to see the list of filenames, and then double-click on one of those to inspect the contents of the file.

### **QUESTION:**

> Look at (at least) one file from each mini-corpus. You'll see that each mini-corpus has a different format for its files. What differences do you see between the two?

### **ANSWER:**

> The first difference I see between the two files is the structure of the content itself. For the txt files within the SciDTB folder, the entire text file is outputed in one line while the text files for the aclArc folder breaks up the files into sections (Abstract, Introduction, etc.) throughout multiple lines.

---

# 2a. Planning and structuring code

Next, we want to access some of the individual texts from our mini-corpora. Let's focus on the SciDTB corpus for now. The first thing we need to do is make a list of all of the filenames in the corpus. Since we're no longer in the safety of the NLTK, we can't count on our trusty `fileids()` method. Instead, we're going to make use of a module in Python called `os` - this stands for **operating system**. The `os` module gives us a way of running commands that we would normal run outside of Python (very much like the `!` prefix mentioned above). 

One of the really useful things we can do with the `os` module is to get a list of all filenames in a directory. We'll break this process into two steps:

1. Create a variable for the directory name (stored as a string).
1. Use an `os` command to get a list of all files in the directory (each of these will also be a string, just like what we get from the `fileids()` method.

Once we have that list, we can iterate through it using a `for`-loop, to process each file, one at a time.

In [None]:
### First we need to import the os module
import os

### Create a variable with the directory name
sciDir = 'SciDTB-test'

### Get a list of all the filenames in that directory
sciFiles = os.listdir(sciDir)
print(sciFiles[:10])   ## it's a long list, so we'll just print the first few

['D14-1020.edu.out', 'P16-1131.edu.out', 'D14-1189.edu.out', 'P14-1116.edu.out', 'D14-1175.edu.out', 'D14-1018.edu.out', 'D14-1005.edu.out', 'D14-1003.edu.out', 'D14-1023.edu.out', 'P16-1145.edu.out']


### **REVIEW QUESTION:**

> Use Python code to figure out how many files are in our SciDTB mini-corpus. Show your code.

### **ANSWER:**

> 152 files, code shown below

### **QUESTION:**

> Now create a list named `arcFiles` with all of the filenames for the aclArc mini-corpus. Show your code.

> 100 files, code shown below

------

In [None]:
print(len(sciFiles))
arcDir = 'aclArc-mini'
arcFiles = os.listdir(arcDir)
print(len(arcFiles))

152
100


## Pseudocode

Now I want to introduce the concept of **pseudocode**. Pseudocode is often a very helpful step to be performed prior to writing your actual code. To write pseudocode, we first *think through* the steps that need to be performed in order to solve our problem, and then *write those steps out* in a kind of language that is somewhere between natural language and programming language. Usually, pseudocode is quite general and is not specific to any particular programming language. So the same pseudocode could be used to guide programmers using Python or Java or C# or JavaScript or or or ... and result in programs that do the same things, in all those different programming languages. 

We're going to try out the process of writing pseudocode to do some things with the files in our mini-corpus. First, though, watch this video explaining pseudocode in detail. In this video, the pseudocode is converted into a programming language, but using JavaScript rather than Python. Don't worry about the fact that the code looks entirely different - focus on the pseudocode.

* Codeacademy on pseudocode (10 minutes): https://www.youtube.com/watch?v=PwGA4Lm8zuE

If you'd still like to hear more, here are two more videos about pseudocode

* 5 minutes to code (5 minutes):
https://www.youtube.com/watch?v=HhBrkpTqzqg

* British guy with cartoons (5 minutes):
https://www.youtube.com/watch?v=XDWw4Ltfy5w

### **QUESTION:**

> As an exercise, write Python code for the FizzBuzz problem (described in the Codeacademy video). The pseudocode instructions are already copied into a code block below. HINT: to get a `for`-loop that will iterate exactly 20 times, use the `range()` function.

### **ANSWER:**

> code shown below:

In [None]:
#### The Problem:
## Write a program that prints the numbers from 1 to 20.
## For multiples of three print "Fizz" instead of the number
## For multiples of five print "Buzz" instead of the number
## For numbers which are multiples of both 3 and 5 print "FizzBuzz"
## For numbers not divisible by 3, or 5, or both, print the number as is

#### The pseudocode:

## FOR LOOP:
## For numbers from 1 to 20
   ## IF number MOD 15 == 0
       ## print 'FizzBuzz'
   ## IF number MOD 3 == 0
       ## print 'Fizz'
   ## IF number MOD 5 == 0
       ## print 'Buzz'
   ## ELSE
       ## print number

In [None]:
for n in range(1,21):
  if n % 15 == 0:
    print('FizzBuzz')
  elif n % 3 == 0:
    print('Fizz')
  elif n % 5 == 0:
    print('Buzz')
  else:
    print(n)

1
2
Fizz
4
Buzz
Fizz
7
8
Fizz
Buzz
11
Fizz
13
14
FizzBuzz
16
17
Fizz
19
Buzz


Now let's look at working with the texts in our mini-corpus -- one useful thing we could do is to build a frequency distribution for those texts. Before we just start typing, let's write some pseudocode.

### Step one: what are the high-level tasks? What is the problem?

The problem is to produce a frequency distribution from a set of plain texts. The high level steps:

* Read the texts in and convert each to a list of words
* Combine the texts into one list
* Create a frequency distribution from that list

### Step two: Now let's turn this into pseudocode:

In [None]:
## VARIABLES AND DATA STRUCTURES NEEDED:
## 1. list of the filenames of texts
## 2. list of words for the entire corpus

## FOR LOOP:
## For each text in the mini-corpus
    ## Read the text in as a string
    ## SPLIT the text into a list of words
    ## Add those words to the corpus word list
## Create FREQDIST from list of words
## Print 20 most common words

### Step three: Finally, turn this into Python code

In [None]:
import nltk
from nltk import FreqDist

## VARIABLES AND DATA STRUCTURES NEEDED:
## 1. list of the filenames of texts
    ### this we already have: sciFiles
## 2. list of words for the entire corpus

sciWords = []

## FOR LOOP:
## For each text in the mini-corpus
for text in sciFiles:
    ## Read the text in as a string
    ## We need to give the whole path: the directory name
    ##    plus the file name, separated by a slash
    thisText = sciDir+"/"+text
    print("now processing", thisText)
    with open(thisText, 'r') as f:
        thisTextString = f.read() ## read in as string
    
    ## SPLIT the text into a list of words
    thisTextWords = thisTextString.split()
    print("Current text word count:", len(thisTextWords))
    
    ## add a debugging step - we will comment these two lines out after
    ## we know we're happy with what the code is doing
    
    # print(thisTextWords)
    # input("okay to proceed? ")

    ## Add those words to the corpus word list
    ## Use extend instead of append, because what we're adding is a list
    ###   of strings, not just a single element
    sciWords.extend(thisTextWords)

### Check the number of files and number of words in the corpus
print()
print("The corpus has", len(sciFiles), "files.")
print("The corpus has", len(sciWords), "words.")

## Create FREQDIST from list of words
sciDist = FreqDist(sciWords)

## Print 20 most common words
print()
print("The 20 most common words are:")
sciDist.most_common(20)

now processing SciDTB-test/D14-1020.edu.out
Current text word count: 134
now processing SciDTB-test/P16-1131.edu.out
Current text word count: 91
now processing SciDTB-test/D14-1189.edu.out
Current text word count: 65
now processing SciDTB-test/P14-1116.edu.out
Current text word count: 157
now processing SciDTB-test/D14-1175.edu.out
Current text word count: 117
now processing SciDTB-test/D14-1018.edu.out
Current text word count: 147
now processing SciDTB-test/D14-1005.edu.out
Current text word count: 87
now processing SciDTB-test/D14-1003.edu.out
Current text word count: 126
now processing SciDTB-test/D14-1023.edu.out
Current text word count: 113
now processing SciDTB-test/P16-1145.edu.out
Current text word count: 132
now processing SciDTB-test/D14-1050.edu.out
Current text word count: 105
now processing SciDTB-test/P16-1102.edu.out
Current text word count: 211
now processing SciDTB-test/D14-1034.edu.out
Current text word count: 146
now processing SciDTB-test/P16-1139.edu.out
Current te

[('.', 816),
 ('the', 784),
 (',', 689),
 ('of', 575),
 ('a', 503),
 ('and', 491),
 ('to', 416),
 ('in', 274),
 ('that', 240),
 ('for', 223),
 ('on', 214),
 ('We', 200),
 ('is', 194),
 (')', 165),
 ('(', 163),
 ('with', 147),
 ('we', 137),
 ('model', 133),
 ('are', 114),
 ('this', 112)]

Stopwords strike again! We can see just one word which seems specific to the corpus: *model*. We'll remove the stopwords in the next section of the lab.

### **QUESTION:**

> Your turn to write pseudocode, but for a simpler problem this time. Write pseudocode for calculating the average text length (in number of words) in this corpus. You can assume that you already have a list of filesnames, but start from there. You do not have to write the Python code.

### **ANSWER:**

> (your answers here)

-----

In [None]:
##VARIABLES NEEDED:
## 1. List of filenames 
    ## (already done, going to use arcFiles)
## 2. List of words for the corpus

##Create empty list for total words in the corpus

##FOR LOOP:
##For each text in corpus:
  ##Read in the text as a string
  ##Split text into list of words
  ##Extend list to previous defined variable

##Create variable for each word in the totalwords list
  ## w for w in totalwords
##Create average calculation variable:
  ##average word length = sum(length(word) for word in totalwords) / length(words)
  ##print statement of the average length of the words in the corpus

In [None]:
totalwords = []

for text in arcFiles:
  text2str = arcDir+"/"+text
  with open(text2str, 'rb') as f:
    newtext = f.read()

  words = newtext.split()
  totalwords.extend(words)

avgwords = [word for word in totalwords]
avg = sum(len(word) for word in avgwords) / len(avgwords)
print("The average word length in the corpus is:", avg)

The average word length in the corpus is: 5.198972054741825


### **BONUS:**

> (worth 5 extra points) Write the Python code for your average-text-length pseudocode.

## Chaining functions together

Next, let's look at how to combine functions into a bigger program. For both of the problems above, we used a very similar process of reading in a text and splitting it into words. Since this is something we may want to do often, let's define a function to do this. The function should take as input the path to the file (this could be just the filename, or the directory plus filename, as we have here), and as output it should return a list of words. The steps in the function are the same as what we did above, including print statements.

In [None]:
### function to split text into list of words
### input: filename (or path to file), as string
### return: list of words in the file

### optionally: uncomment the print statements for more output from the function

def wordList(textPath):
    #print()
    #print("Extracting words from:", textPath)
    with open(textPath) as f: 
        textString = f.read()
    
    textWords = textString.split()
    #print("Word count:", len(textWords))

    return textWords

Instead of writing this code inside our `for`-loop, we can simply call our new function, named `wordList`.

In [None]:
corpusWords = []

for filename in sciFiles:
    path = sciDir+"/"+filename
    print("Now processing:", path)
    fileWords = wordList(path)   ### calling function to get word list
    corpusWords.extend(fileWords)

print("Corpus word count:", len(corpusWords))

As we start to write code of greater complexity, we'll want to think in terms of combining functions. A common practice is that the output of one function might serve as input to another function. Let's see what this looks like, by writing another function to remove stopwords and punctuation from the word list produced by `wordList()`. The code for removing stopwords and punctuation should be familiar by now.

In [None]:
### necessary downloads and imports
nltk.download('stopwords')
from nltk.corpus import stopwords as sw
import string
enStops = sw.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
### function to remove stopwords and punctuation
### input: list of words
### returns: list of words with no stopwords and no punctuation

### optionally: uncomment print statement for more output from function

def cleanWordList(words):
    noStops = [w for w in words if w not in enStops]
    noPunct = [w for w in noStops if w not in string.punctuation]
    
    #print("New word count:", len(noPunct))

    return noPunct


We can add this cleaning step into the `for`-loop. Notice that the output of `wordList()` is the input to `cleanWordList()`.

In [None]:
corpusCleanWords = []

for filename in sciFiles:
    path = sciDir+"/"+filename
    fileWords = wordList(path)
    cleanWords = cleanWordList(fileWords)
    corpusCleanWords.extend(cleanWords)

print("Cleaned corpus word count:", len(corpusCleanWords))

Cleaned corpus word count: 11812


Remember that we can call functions that have been defined anywhere in our program, as long as the program has been defined before we call it. You will often see that data may be sent to many different functions before returning back to the main flow of the program.

### **QUESTION:**

> With the new, cleaned word list, what are the 20 most common words in the corpus? (Show your code.)

### **ANSWER:**

> code shown below:

-----

In [None]:
len(corpusCleanWords)
CleanDist = FreqDist(corpusCleanWords)
print("The 20 most common words in the cleaned corpus are:")
print(CleanDist.most_common(20))

The 20 most common words in the cleaned corpus are:
[('We', 200), ('model', 133), ('models', 94), ('approach', 93), ('show', 86), ('translation', 83), ('word', 78), ('In', 73), ('language', 72), ('The', 71), ('data', 69), ('semantic', 64), ('using', 63), ('results', 63), ('features', 61), ('task', 59), ('paper', 57), ('Our', 54), ('learning', 54), ('method', 51)]


# 2b. Error handling and debugging

**NOTE:** This section of Lab 6 is primarily reading, with just a couple of exercises.

Debugging is the process of modifying your code until it works properly, and it is a *constant companion* to writing code. Some programmers (e.g. my partner) estimate that programming is something like 20% writing new code and 80% testing and debugging. It's something you'll get better and better at over time, as you start to recognize common errors and error patterns. 

*Think Python* has a short section about debugging at the end of each chapter with tips and strategies for debugging. There's a *lot* of useful information in these sections! In particular, I suggest reading the debugging sections for chapters 2 (sec. 2.8), 3 (sec. 3.12), 5 (sec. 5.12), 6 (sec. 6.9), 7 (sec. 7.7), and 8 (sec. 8.11). It sounds like a lot of reading, but in fact each section shouldn't take more than a few minutes.

Two highly-recommended strategies for debugging:

* `print` statements
* `assert` statements

`print` statements should be used to check your variables and data structures at different points in the process, so that you can see whether the code has actually done what you expected it to do (very often the answer is no, and these discrepancies help us pinpoint errors in our code). 

Using lots of `print` statements can also help to find exactly where something is going wrong. At an extreme, you can `print` a variable before and after each statement that does something to the variable. For example:




In [None]:
mytext = '''If there is a semantic error in your program, it will run without generating error messages, but it will not do the right thing. It will do something else.'''

### first we'll see whether split() is doing what's expected
print(mytext)
mylist = mytext.split()
print(mylist)

### it looks okay, except for things like the comma after 'program'
### let's try removing word-final punctuation - remember rstrip()?
### we'll then print the new version of the list

newList = [w.rstrip(string.punctuation) for w in mylist]
print(newList)

### good, that got rid of punctuation after words
### now how about stopwords?

thirdList = [w for w in newList if w not in enStops]
print(thirdList)


If there is a semantic error in your program, it will run without generating error messages, but it will not do the right thing. It will do something else.
['If', 'there', 'is', 'a', 'semantic', 'error', 'in', 'your', 'program,', 'it', 'will', 'run', 'without', 'generating', 'error', 'messages,', 'but', 'it', 'will', 'not', 'do', 'the', 'right', 'thing.', 'It', 'will', 'do', 'something', 'else.']
['If', 'there', 'is', 'a', 'semantic', 'error', 'in', 'your', 'program', 'it', 'will', 'run', 'without', 'generating', 'error', 'messages', 'but', 'it', 'will', 'not', 'do', 'the', 'right', 'thing', 'It', 'will', 'do', 'something', 'else']
['If', 'semantic', 'error', 'program', 'run', 'without', 'generating', 'error', 'messages', 'right', 'thing', 'It', 'something', 'else']


Another useful strategy is to write an `assert` statement. `assert` is something like a short-hand way of writing an `if`-condition. If the statement following `assert` is true, nothing happens. If the statement following `assert` is false, we'll get an exception/error message.

In [None]:
### in this example, we want to check whether our list is shorter than 20 items
assert len(thirdList) < 20

Because the condition (`len(thirdList) < 20`) is true, nothing happens. Let's see what happens with a false condition.

In [None]:
### in this example, we want to check whether our list is longer than 20 items
assert len(thirdList) > 20

And we get an error message. `assert` statements can be helpful if we need to make sure some condition is met before proceeding.

### Exceptions

Exceptions are error messages - they are the responses that Python gives us whene there's a problem with our code. Exceptions are very useful in communicating that there are syntax errors (or other runtime errors), but they can be frustrating too. When we encounter an exception, the code halts execution - the program stops running. 

**READING:** Python Crash Course, chapter 10, pages 200-207.

The `try`/`except` structure allows us to handle exceptions without the code crashing to a halt. 

### **QUESTION:**

> a. Exercise 10-6 from *Python Crash Course*: One common problem when prompting for numerical input occurs when people provide text instead of numbers. When you try to convert the input to an `int`, you'll get a `ValueError`. Write a program that prompts the user to input two numbers. Add them together and print the result. Catch the `ValueError` if either input value is not a number, and print a friendly error message. Test your program by entering two numbers and then by entering some text instead of a number.

> b. Exercise 10-8 from *Python Crash Course*: Make two files, *cats.txt* and *dogs.txt*. Store at least three names of cats in the first file and three names of dogs in the second file. Write code that tries to read these files and print the contents of the file to the screen. Wrap your code in a `try`-`except` block to catch the `FileNotFound` error, and print a friendly message if a file is missing. Change the name of one of the files, and make sure the code in the `except` block executes correctly.

### **ANSWER:**

code shown below:

-----

In [None]:
try:
  first = input("Give me a number:")
  second = int(first)
  third = input("And another number:")
  fourth = int(third)

except ValueError:
  print("Sorry, that's not what I'm looking for. Please enter a number.")

else:
  adding = first + third
  print("The sum of " + str(first) + " and " + str(third) + " equals to " + adding)


Give me a number:1
And another number:2
The sum of 1 and 2 equals to 12


In [None]:
new_file = open('cats.txt', 'w')

new_file.writelines(['\nbella', '\ntom', '\nhoney'])
new_file.close()

second_file = open('dogs.txt', 'w')

second_file.writelines(['\nrocky', '\nmilo', '\nzenki'])
second_file.close()

filenames = ['birds.txt', 'dogs.txt']

for filename in filenames:
  print("\nOpening file: " + filename)
  try:
    with open(filename, 'r') as f:
      contents = f.read()
      print(contents.upper())
  except FileNotFoundError:
    print("Sorry that file doesn't exist.")
     


Opening file: birds.txt
Sorry that file doesn't exist.

Opening file: dogs.txt

ROCKY
MILO
ZENKI


### HOORAY! 
Congratulations, you've finished Lab 6! Take a breather now before you move on to Lab 7. There are fun and exciting things ahead!

# 3. Classes and objects in Python

So far we have been writing code mostly in a way that is usually described as **procedural programming**. In this style of programming, the code consists of a sequence of actions (statements, procedures) to be performed in a specified manner. We accomplish things largely by building data structures, like lists or dictionaries, and then doing things with or to those data structures.

In this lab, we'll take a first look at another style of programming called **object-oriented programming (OOP)**. Python allows for both procedural and object-oriented programming, and very often the most effective code is some combination of the two. For today, we will simply learn about two fundamental concepts of OOP, **classes** and **objects**. 

A **class** is similar to a function in that it must be defined before it can be used. One way to think of a class is as a blueprint - it defines a category that is relevant for the program at hand. A class definition can then be used to create **objects** - each object is an **instance** of the class. The class definition specifies properties of the objects which belong to that class, as well as functions or methods that are specific to the class.

It might be helpful to think about an analogy from the non-digital world. If we applied this approach to objects in the real world, we could think about (for example) the class of *birds*. There are many subclasses of birds, like cardinals, herons, penguins, or egrets, and then many individual bird objects which are instances of the class. The bird class definition might specify properties like having classes, laying eggs, or flying, and we'd expect the individual bird instances to have these properties. There are also properties like size or diet that need to be specified for individual bird objects.

With this analogy in mind, turn now to chapter 9 of *Python Crash Course* for a very clear, detailed, and specific discussion of classes and objects.

**READING:** Python Crash Course, Chapter 9, pages 161-171

Try out the code examples as you read the chapter, and the exercises below (from the text). In the next section, we'll apply this knowledge to text.

### **QUESTION:**

> a. Exercise 9-1 from Python Crash Course (modified). Make a class called `Book`. The `__init__()` method for `Book` should store two attributes: a `title` and an `author`. Make a method called `book_info()` that prints these two pieces of information, and a method called `book_available()` that prints a message indicating that the book is available in the library.

> b. Create an instance called `book1` from your class. Print the two attributes individually, and then call both methods.

> c. Exercise 9-2 from Python Crash Course (modified). Create three different instances from the class, and call `book_info()` for each instance.

> d. Exercise 9-4 from Python Crash Course (modified). Add an attribute called `number_reads` with a default value of 0. Create a new instance called `book5` from this class. Print the number of times the book has been read, and then change this value and print it again.

> e. Add a method called `set_number_reads()` that lets you set the number of times the book has been read. Call this method with a new number and print the value (`number_reads`) again.

> f. Add a method called `increment_number_reads()` that lets you increment the number of customers who've been served. Call this method with any number you like.

### **ANSWER:**

code shown below:

-----

In [None]:
class Book():
  def __init__(self, title, author):
    self.title = title.title()
    self.author = author.title()
    self.number_reads = 0

  def book_info(self):
    msg = self.title + " written by " + self.author
    print("\n" + msg)

  def book_available(self):
    msg = self.title + " is available at the library."
    print("\n" + msg)
  
  def set_number_reads(self, number_reads):
    self.number_reads = number_reads

  def increment_number_reads(self, additional_reads):
    self.number_reads += additional_reads

book1 = Book('in cold blood', 'truman capote')
print(book1.title)
print(book1.author)

book1.book_info()
book1.book_available()

book5 = Book('in our time', 'ernest hemingway')
book5.book_info()

print("\nNumber Read: " + str(book5.number_reads))

book5.number_reads=9
print("Number Read: " + str(book5.number_reads))

book5.set_number_reads(12)
print("Number Read:" + str(book5.number_reads))

book5.increment_number_reads(3)
print("Number Read:" + str(book5.number_reads))


In Cold Blood
Truman Capote

In Cold Blood written by Truman Capote

In Cold Blood is available at the library.

In Our Time written by Ernest Hemingway

Number Read: 0
Number Read: 9
Number Read:12
Number Read:15


In [None]:
truman = Book('in cold blood', 'truman capote')
king = Book('carrie', 'stephen king')
plath = Book('the bell jar', 'sylvia plath')

truman.book_info()
king.book_info()
plath.book_info()


In Cold Blood written by Truman Capote

Carrie written by Stephen King

The Bell Jar written by Sylvia Plath


# 4. Text preprocessing with spaCy

And now we get to the big payoff, the really exciting stuff: preprocessing and using the spaCy toolkit (https://spacy.io/). spaCy is a Python library for natural language processing (NLP) that is easy to install, easy to use, and super fast and powerful to boot. It's widely used, both in research and in industry. There are many, many useful functions available through spaCy, and our focus for today is a set of methods for preprocessing text.

As the name suggests, preprocessing is the stuff we do *before* processing - these are steps we take to get raw text ready for more complicated processing steps. Today we will look at four different preprocessing steps: tokenization, sentence splitting, lemmatization, and part-of-speech (POS) tagging.

Though each of these steps seems quite simple to perform, that's because we are looking at them through the eyes of a human speaker of language. The tasks performed in preprocessing are not at all trivial for a machine, and doing them automatically will certainly lead to some errors and some noise in the data. At the same time, the amount of mistakes is usually small enough that it doesn't negatively effect the outcome of the later processing steps (at least for English). 

### Why preprocessing matters

The tasks that we perform in preprocessing involve a number of decisions. For example, if we come across the word *didn't*, do we want to treat that as one token or two? If we treat it as just one token, we're saying that *did* and *didn't* are two different word types, and we are ignoring the fact that the *-n't* at the end of *didn't* has the same meaning as that same suffix on other words like *can't* or *wouldn't*. If we then treat it as two tokens, how do we represent the second token - is it *n't* or *-n't* or *not*? Some of these decisions don't have a huge impact on our processing of text, but it is **essential* that we make the same decision every time we see a word ending in *-n't*. 

Handling preprocessing carefully leads to better, more consistent analyses with more reliable results. 

### Preprocessing is largely language-specific

Many preprocessing steps rely on language-specific knowledge. For example, we can (mostly) use white space to separate words in English, but there are (usually) no spaces between words in Mandarin Chinese. Libraries for doing preprocessing need to keep in mind the properties of the language that the text is written in. (And that's assuming our texts are written in only one language!)

Now let's look at each of these steps in turn. For each one, we'll read a little bit about the motivation for that step, we'll see some examples, and we'll see how to use spaCy to get it done.

##Preliminaries: import spaCy and an English-language model (from spaCy)

Just like NLTK, spaCy needs to be imported as a Python library. After importing `spacy`, our next step is to load a model. Here we're using the smaller version of the core English language model, trained on web data. 

To learn about other English models available through spaCy: https://spacy.io/models/en -- there are three sizes: small, medium, and large. There are models for some other languages too: https://spacy.io/usage/models

Loading the model creates a `Language` object (that's right, an instance of the spacy-defined class `Language`) the contains all of the data and the components needed to process text in that language. By convention, usually we assign this object to the variable `nlp`, because this is the object that we'll use to process texts.

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")


**READING:** Speech and Language Processing:

Chapter 2: https://web.stanford.edu/~jurafsky/slp3/2.pdf

* sections 2.4-2.4.2 (tokenization)
* section 2.4.5 (sentence splitting)
* section 2.4.4 (lemmatization)

Chapter 8: https://web.stanford.edu/~jurafsky/slp3/8.pdf

* section 8.1 (word classes in English)
* section 8.2 (the Penn Treebank tagset)

## Tokenization & sentence splitting

Two of the most essential preprocessing steps are **tokenization** and **sentence splitting**. Tokenization involves splitting a long string of text into its individual tokens (including splitting off punctuation), and sentence splitting is just what it sounds like - dividing that string of text into sentences. To better understand how important these are, imagine that we didn't have these kinds of divisions. Any text - whether it's a haiku, Moby Dick, or the entire Bible - would just be one long string, and viewed by the computer as just one long sequence of characters, with a bunch of spaces and other white space. Breaking the text into sentences, and then into tokens, adds structure to otherwise unstructured text. 

For both tokenization and sentence splitting, the way that English text is written provides some pretty strong cues, but they're not perfect.

### **QUESTION:**

> a. What is the main cue for splitting English documents into sentences? When and how might it go wrong? (No coding required.)

> b. What is the main cue in English for splitting sentences into tokens? When and how might it go wrong? (No coding required.)

### **ANSWER:**

> a. I think the main cue for sentence segmentation is punctuation (periods, question marks, exclamation points, etc.) which usually signify the end of a sentence. A pretty common problem here are using periods as cues, since they can be pretty ambiguous when identifying a sentence boundary or marking an abbreviation.

> b. Splitting sentence into tokens is similarily cued by punctuations, specifically periods. Identifying whether a period is marking an abbreviation or the end of the sentence would be the first step for tokenizing.

-----




To process new text, we give the text (as a string) as an argument to the `nlp` object. It will return a processed version of the document. The processed document is an object of spacy's `Doc` class and offers many useful attributes and functionalities. As part of this processing, the text is split into tokens (we would say 'the text gets tokenized'), which we can see by iterating through the tokens of the document.

In [None]:
mystring = "Denton, Texas has been my home for four years. I'm a professor at the Univ. of North Texas, in the Linguistics Department. My cat's name is Bella, and you have all seen her on zoom."

mydoc = nlp(mystring)
for token in mydoc:
    print(token.text)

### **QUESTION:**

> What do you notice about how spacy tokenizes punctuation? Anything that surprises you? (No coding required.)

### **ANSWER:**

> I've worked with spacy tokenization before so no surprises here.

-----

Similarly, the document has already been split into sentences.

In [None]:
sentences = list(mydoc.sents)
for s in sentences:
    print(s.text)

Denton, Texas has been my home for four years.
I'm a professor at the Univ.
of North Texas, in the Linguistics Department.
My cat's name is Bella, and you have all seen her on zoom.


### **QUESTION:**

> Any problems with the sentence splitting?

### **ANSWER:**

> Yes, again identifying a period as an abbreviation or end-of-sentence marker is the typical problem. So the abbreviation Univ. is split into two sentences instead of one.

-----

## Lemmatization

For each token that spaCy identifies, it also determines the lemma, or base form of the word. Just like `text`, `lemma` is also an attribute of the token. For many of the token attributes (lemma, pos, tag, etc.) we need to add an underscore after the name of the attribute. Adding the underscore tells Python that we want to see the human-readable form of the lemma rather than spacy's internal representation. (Try without the _ and see what happens!)

In [None]:
for token in mydoc:
    print(token.text, token.lemma_)

Denton Denton
, ,
Texas Texas
has have
been be
my -PRON-
home home
for for
four four
years year
. .
I -PRON-
'm be
a a
professor professor
at at
the the
Univ Univ
. .
of of
North North
Texas Texas
, ,
in in
the the
Linguistics Linguistics
Department Department
. .
My -PRON-
cat cat
's 's
name name
is be
Bella Bella
, ,
and and
you -PRON-
have have
all all
seen see
her -PRON-
on on
zoom zoom
. .


### **QUESTION:**

> What do you see that's interesting in the lemmas associated with the tokens? Describe at least two different cases where the text and the lemma are different (for many words, the two are identical).

### **ANSWER:**

> For most of these the text and lemma are identical - however, with pronouns the lemma is only labeled by the type PRON since there is no clear base form for pronouns in English. Another difference is with plurals (years, year) which spacy seems to handle fine.

-----

## Part-of-speech (POS) tagging

The part-of-speech of a word captures a lot of information about how the word behaves. Sometimes it can be very useful to be able to distinguish nouns from verbs, or adjectives from nouns. Other times, we may want to count the relative proportion of certain POS categories. For example, authors may differ in how descriptive their texts are, and we might try to measure that by looking at the relative proportion of adjectives in a text.

For each token, spaCy provides two different POS tags. `pos` is a coarse-grained, very general part of speech. `tag` is a fine-grained, more specific part of speech. For example, `VERB` is a coarse-grained part of speech, and the fine-grained categories associated with verb specify things like tense (past, present, future) and person (first person, second person, third person).

In [None]:
for token in mydoc:
    print(token.text, token.pos_, token.tag_)

In [None]:
### spacy offers a method for getting explanations of any POS tag
print(spacy.explain('PART'))
print(spacy.explain('POS'))

particle
possessive ending


## Other useful functions in spaCy

### Noun chunks:

spaCy can produce a list of noun chunks in a text:



In [None]:
noun_chunks = list(mydoc.noun_chunks)
for chunk in noun_chunks:
    print(chunk.text)

### Named entities

The document produced by spaCy includes automatic recognition and classification of named entities in the text. There are many different types of named entities, from language names to works of art, to people, organizations, and geo-political entities (like countries, cities, states). You can see the named entities by iterating through the relevant attribute (`ents`) of the document, or by using spaCy's visualizer (which is called, cutely, *displacy*).

In [None]:
for ent in mydoc.ents:
    print(ent.text, ent.label_)

Denton GPE
Texas GPE
four years DATE
North Texas LOC
the Linguistics Department ORG
Bella PERSON


In [None]:
from spacy import displacy
displacy.render(mydoc, style="ent", jupyter=True)

Finally, each token in the processed document is realized as an instance of the `Token` class, and there are a large number of attributes available through that class. Some of them are illustrated below, and more of them are described here: https://spacy.io/api/token#attributes

In [None]:
### first we'll pick one token to look at
denton = mydoc[0]

print("Text:", denton.text)
print("Lemma:", denton.lemma_)
print("Coarse-grained POS tag:", denton.pos_)
print("Fine-grained POS tag:", denton.tag_)
print("Word shape:", denton.shape_)

### some attributes are Boolean
print("Alphabetic characters?", denton.is_alpha)
print("Punctuation mark?", denton.is_punct)
print("Digit?", denton.is_digit)
print("Like a number?", denton.like_num)
print("Is it a stopword?", denton.is_stop)


Text: Denton
Lemma: Denton
Coarse-grained POS tag: PROPN
Fine-grained POS tag: NNP
Word shape: Xxxxx
Alphabetic characters? True
Punctuation mark? False
Digit? False
Like a number? False
Is it a stopword? False


This is just the start of what we'll do with spaCy. There's a *huge* amount of documentation on the spaCy website. If you're eager to learn more right away, I'd suggest these two links as good places to start:

 * spaCy 101 (free interactive course): https://spacy.io/usage/spacy-101
 * more on linguistic features: https://spacy.io/usage/linguistic-features



# 5. Putting it together

TIP: You may want to first write pseudocode for section of the lab.

For this final section of the lab, you will pick out 10 documents from the mini-corpus and run them through spaCy. Using spaCy functions (most of these are already built in to spaCy), do the following tasks:

* Count the number of sentences in the subcorpus (your 10 documents)
* Print all of the noun chunks in your subcorpus
* Print all of the named entities in your subcorpus

Do you see anything unexpected in the output?

Next, make a list of all of the lemmas in your corpus. Build a frequency distribution from this list. What are the top 10 most-frequent lemmas? 


TIP: once you have created a list of filenames (I've called mine `mytexts`), you can use this code to convert the ten texts to one big string. This string will be the input to the `nlp()` function. 

Click 'SHOW CODE' to see the code (I've hidden the block, in case people want to work this out on their own first). 

In [None]:
#@title
# first create a list of texts  - I've selected the first 10 files
# from sciDTB-test 
mytexts = sciFiles[:10]

# now we'll read each text in, one at a time
# we are creating a string (corpusString) to hold the text of 
# the selected files - each file gets read in as one big string
# and added to corpusString
# 
# we use the variable sum to collect a count of the characters

corpusString = ''
sum = 0
for text in mytexts:
    this = sciDir+"/"+text
    with open(this) as f:
        thisString = f.read()
    corpusString = corpusString + thisString
    sum = sum + len(thisString)

print(corpusString)
print(len(corpusString))
print(sum)


Automatic metrics are widely used in machine translation as a substitute for human assessment . With the introduction of any new metric comes the question of just how well that metric mimics human assessment of translation quality . This is often measured by correlation with human judgment . Significance tests are generally not used to establish whether improvements over existing methods such as BLEU are statistically significant or have occurred simply by chance , however . In this paper , we introduce a significance test for comparing correlations of two metrics , along with an open-source implementation of the test . When applied to a range of metrics across seven language pairs , tests show that for a high proportion of metrics , there is insufficient evidence to conclude significant improvement over BLEU . 
This paper presents neural probabilistic parsing models which explore up to thirdorder graph-based parsing with maximum likelihood training criteria . Two neural network extens

In [None]:
new_texts = sciFiles[20:30]
corpus_string = ''
sum = 0
for text in new_texts:
  path = sciDir+"/"+text
  with open(path) as f:
    new_string = f.read()
  corpus_string = corpus_string + new_string
  sum = sum + len(new_string)

# print(len(new_texts))
print(corpus_string)
print("Number of sentences:", len(corpus_string))
print("Number of characters:", sum)

newdoc = nlp(corpus_string)
nc = list(newdoc.noun_chunks)
print("Noun Chunks in the subcorpus:")
print(nc)

print()
for ent in newdoc.ents:
  print(ent.text, ent.label_)

print()
lemmas = []
for token in newdoc:
  lemmas.append(token.lemma_)
# print(len(lemmas))
lemmadist = FreqDist(lemmas)
print("The top 10 most frequent lemmas are:")
lemmadist.most_common(10)

We propose the first probabilistic approach to modeling cross-lingual semantic similarity ( CLSS ) in context which requires only comparable data . The approach relies on an idea of projecting words and sets of words into a shared latent semantic space spanned by language-pair independent latent semantic concepts ( e.g. , cross-lingual topics obtained by a multilingual topic model ) . These latent cross-lingual concepts are induced from a comparable corpus without any additional lexical resources . Word meaning is represented as a probability distribution over the latent concepts , and a change in meaning is represented as a change in the distribution over these latent concepts . We present new models that modulate the isolated out-of-context word representations with contextual knowledge . Results on the task of suggesting word translations in context for 3 language pairs reveal the utility of the proposed contextualized models of cross-lingual semantic similarity . 
Left-to-right ( L

[('.', 50),
 ('the', 43),
 ('-', 42),
 (',', 42),
 ('-PRON-', 30),
 ('of', 30),
 ('a', 30),
 ('to', 28),
 ('be', 28),
 ('and', 22)]

6. REMINDER: Wrapping up and submitting

Create a revision by going to **File > Save and Pin Revision**.

View your revision history at **File > Revision History**.

To submit: go to **Share** in the upper right corner, click **Get Shareable Link**, change the dropdown menu option to **Anyone with the link can edit**, and then **Copy Link**! This is what you'll submit on Canvas.